In an effort to refresh my predictive modeling skills, I entered an ongoing Kaggle Competition where competitors attempt to predict house prices in Ames, Iowa.
I started this competition by just focusing on getting a good understanding of the dataset. The EDA is detailed and many visualizations are included.
Feature engineering was performed on a handful of variables, which greatly improved model performance.
The XGBoost model ended up performing very well with a cross validation RMSE of 0.1177.
Kaggle describes this competition as follows:
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
# Read packages
library(ggplot2)
library(readr)
library(dplyr)
library(ggrepel)
library(scales)
library(knitr)
library(corrplot)
library(plyr)
library(randomForest)
library(gridExtra)
library(psych)
library(caret)
library(xgboost)
test <- read.csv("test.csv", stringsAsFactors = F)
train <- read.csv("train.csv", stringsAsFactors = F)
#Saving IDs in a vector for later
test_labels <- test$Id
test$Id <- NULL
train$Id <- NULL
test$SalePrice <- NA
all <- rbind(train, test)
The dataset consists of characters and integers. Most of the character variables are actually factors, but I read them in as character strings because most of them require cleaning and/or feature engineering.
numericVars <- which(sapply(all, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on
##str(numericVarNames)
##cat('There are', length(numericVars), 'numeric variables')
ggplot(data=all[!is.na(all$SalePrice),], aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by = 100000), labels = comma)
summary(all$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
SalePrice is very right skewed. This was expected as very few people can afford very expensive houses. This will be kept in mind, and will be dealt with before modeling.
numeric.vars <- which(sapply(all, is.numeric)) #index vector numeric variables
numeric.var.names <- names(numeric.vars) #saving names vector for use later on
all.num.var <- all[, numeric.vars]
cor.num.var <- cor(all.num.var, use="pairwise.complete.obs") #correlations of all numeric variables
cor.sorted <- as.matrix(sort(cor.num.var[,'SalePrice'], decreasing = TRUE)) #sort on decreasing correlations with SalePrice
#select only high corelations
CorHigh <- names(which(apply(cor.sorted, 1, function(x) abs(x)>0.5)))
cor.num.var <- cor.num.var[CorHigh, CorHigh]
corrplot.mixed(cor.num.var, tl.col="black", tl.pos = "lt", number.cex=0.75)
The correlation plot shows that 10 variables have a correlation above 0.5 with SalePrice. We also see that there is some multicollinearity going on. GarageArea has a 0.89 correlation with GarageCars. X1stFlrSF has a 0.80 correlation with TotalBsmtSF. TotRmsAbvGrd has a 0.81 correlation with GrLivArea.
The variable Overall Quality has the highest correlation with SalePrice. Overall quality is defined as “The overall material and finish of the house,” and provides a ranking between 1 through 10.
ggplot(data=all[!is.na(all$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
geom_boxplot(col='blue') + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
As the Overall Quality of the house increases, Sale Price increases.
ggplot(data=all[!is.na(all$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
## `geom_smooth()` using formula = 'y ~ x'
First, let’s check and see which variables have missing values.
NAcol <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
cat('There are', length(NAcol), 'columns with missing values')
## There are 35 columns with missing values
The 1459 NAs in SalePrice match the size of the test set perfectly. This means that NA’s need to be fixed in 34 predictor variables.
Here I will take a look at the 34 variables that contain missing values. I will begin with the variables that has the most NA’s and work my way down. If I come across a variable that actually forms a group with other variables, I will deal with them at the same time. For example, Pool, Garage and Basement all have more than one variable relating to the area.
I decided to use knitr’s “Tabs” ability to keep the document more readable. If you don’t want to read every section, the Garage and Basement sections are especially interesting.
Along with fixing NA’s, I have also converted character variables into ordinal integers and factors. I will later convert the factors into numeric variables by creating dummy variables.
Pool Quality and the PoolArea variable
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
NA means No Pool. Easy enough. The high number of NA’s makes sense as not many houses have pools.
all$PoolQC[is.na(all$PoolQC)] <- 'None'
It is also clear that this variable ordinal and can be label encoded. Because there are multiple variables that use the same quality levels of 0-5, I am going to create a vector that I can reuse later on.
Qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Now, we can use the revalue function to assign a number to each string
all$PoolQC<-as.integer(revalue(all$PoolQC, Qualities))
table(all$PoolQC)
##
## 0 2 4 5
## 2909 2 4 4
Only ten houses have pools.
However, there is a second variable that relates to Pools. This is the PoolArea variable (in square feet). There are 3 houses that have a PoolArea but no PoolQC. First, I checked if there was a clear relation between the PoolArea and the PoolQC. As I did not see a clear relation (bigger or smaller pools with better PoolQC), I am going to impute PoolQC values based on the Overall Quality of the houses (which is not very high for those 3 houses).
all[all$PoolArea>0 & all$PoolQC==0, c('PoolArea', 'PoolQC', 'OverallQual')]
## PoolArea PoolQC OverallQual
## 2421 368 0 4
## 2504 444 0 6
## 2600 561 0 3
all$PoolQC[2421] <- 2
all$PoolQC[2504] <- 3
all$PoolQC[2600] <- 2
Please return to the Tabs menu to view work on other variables
MiscFeature has 2814 NAs.
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
This is a clear case of a factor, where NA means None.
all$MiscFeature[is.na(all$MiscFeature)] <- 'None'
all$MiscFeature <- as.factor(all$MiscFeature)
ggplot(all[!is.na(all$SalePrice),], aes(x=MiscFeature, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
table(all$MiscFeature)
##
## Gar2 None Othr Shed TenC
## 5 2814 4 95 1
Interestingly, properties with large sheds sold for less money than properties with no extra features. Maybe having a shed means the property doesn’t have a garage in many cases? And with there only being one house that has a tennis court, this variable isn’t particularly useful.
Please return to the Tabs menu to view work on other variables
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
Alley is a factor variable. Most houses do not have alleys.
all$Alley[is.na(all$Alley)] <- 'None'
all$Alley <- as.factor(all$Alley)
ggplot(all[!is.na(all$SalePrice),], aes(x=Alley, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
table(all$MiscFeature)
##
## Gar2 None Othr Shed TenC
## 5 2814 4 95 1
Ninety one properties are connected to an Alley, and, as expected, houses on paved alleys are worth more than houses on gravel.
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
This is a true factor fariable as the categories are not ordinal. It seems like this should’ve been two different variables, as two categories could be true at one time. A fence could have both good privacy and good wood.
all$Fence[is.na(all$Fence)] <- 'None'
ggplot(all[!is.na(all$SalePrice),], aes(x=Fence, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
all$Fence <- as.factor(all$Fence)
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
Ordinal factor regarding the quality of the property’s fireplace.
all$FireplaceQu[is.na(all$FireplaceQu)] <- 'None'
all$FireplaceQu<-as.integer(revalue(all$FireplaceQu, Qualities))
table(all$FireplaceQu)
##
## 0 1 2 3 4 5
## 1420 46 74 592 744 43
ggplot(all[!is.na(all$SalePrice),], aes(x=FireplaceQu, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
Having no fireplace is slightly better than having a low quality
fireplace. The properties with the highest quality fireplaces have
incredible value, more than $100,000 higher than the next level of
fireplace quality.
There are three variables related to lots. LotFrontage has NAs and LotShape and LotConfig are complete.
LotFrontage: Linear feet of street connected to property
There are almost 500 NAs. It is unreasonale for a house to have 0 feet of street property, and these values need to be imputed. I am going to impute these values by taking the median value of other houses in the same neighborhood.
ggplot(all[!is.na(all$LotFrontage),], aes(x=as.factor(Neighborhood), y=LotFrontage)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "summary", fun.y = "median", fill = "blue"):
## Ignoring unknown parameters: `fun.y`
## No summary function supplied, defaulting to `mean_se()`
for (i in 1:nrow(all)){
if(is.na(all$LotFrontage[i])){
all$LotFrontage[i] <- as.integer(median(all$LotFrontage[all$Neighborhood==all$Neighborhood[i]], na.rm=TRUE))
}
}
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
No NAs. The values seem ordinal.
all$LotShape<-as.integer(revalue(all$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(all$LotShape)
##
## 0 1 2 3
## 16 76 968 1859
sum(table(all$LotShape))
## [1] 2919
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
No NAs. The values are not ordinal so I converted the variable to a factor.
all$LotConfig <- as.factor(all$LotConfig)
table(all$LotConfig)
##
## Corner CulDSac FR2 FR3 Inside
## 511 176 85 14 2133
sum(table(all$LotConfig))
## [1] 2919
There are 7 garage variables: GarageYrBlt, GarageFinish, GarageQual, GarageCond, GarageType, GarageCars, and GarageArea.
Two of those have one NA (GarageCars and GarageArea), one has 157 NAs (GarageType), and the other four have 159 NAs.
I am going to replace all NAs with the year the house was built.
all$GarageYrBlt[is.na(all$GarageYrBlt)] <- all$YearBuilt[is.na(all$GarageYrBlt)]
As NAs mean ‘No Garage’ for character variables, I now want to find out where the differences between the 157 NA GarageType and the other 3 character variables with 159 NAs come from.
#check if all 157 NAs are the same observations among the variables with 157/159 NAs
length(which(is.na(all$GarageType) & is.na(all$GarageFinish) & is.na(all$GarageCond) & is.na(all$GarageQual)))
## [1] 157
#Find the 2 additional NAs
kable(all[!is.na(all$GarageType) & is.na(all$GarageFinish), c('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
| GarageCars | GarageArea | GarageType | GarageCond | GarageQual | GarageFinish | |
|---|---|---|---|---|---|---|
| 2127 | 1 | 360 | Detchd | NA | NA | NA |
| 2577 | NA | NA | Detchd | NA | NA | NA |
The 157 NAs within GarageType all turn out to be NA in GarageCondition, GarageQuality, and GarageFinish as well. The differences are found in houses 2127 and 2577. As you can see, house 2127 actually does seem to have a Garage and house 2577 does not. Therefore, there should be 158 houses without a Garage. To fix house 2127, I will impute the most common values for GarageCond, GarageQual, and GarageFinish.
#Imputing modes.
all$GarageCond[2127] <- names(sort(-table(all$GarageCond)))[1]
all$GarageQual[2127] <- names(sort(-table(all$GarageQual)))[1]
all$GarageFinish[2127] <- names(sort(-table(all$GarageFinish)))[1]
#display "fixed" house
kable(all[2127, c('GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
| GarageYrBlt | GarageCars | GarageArea | GarageType | GarageCond | GarageQual | GarageFinish | |
|---|---|---|---|---|---|---|---|
| 2127 | 1910 | 1 | 360 | Detchd | TA | TA | Unf |
GarageCars and GarageArea: The number of cars that can fit in the garage and the area in Square Feet
Both have 1 NA. As you can see above, it is house 2577 for both variables. The problem probably occured as the GarageType for this house is “detached”, while all other Garage-variables seem to indicate that this house has no Garage.
#fixing 3 values for house 2577
all$GarageCars[2577] <- 0
all$GarageArea[2577] <- 0
all$GarageType[2577] <- NA
#check if NAs of the character variables are now all 158
length(which(is.na(all$GarageType) & is.na(all$GarageFinish) & is.na(all$GarageCond) & is.na(all$GarageQual)))
## [1] 158
Now, the 4 character variables related to garage all have the same set of 158 NAs, which correspond to ‘No Garage’. I will fix all of them in the remainder of this section
GarageType: Garage location
The values do not seem ordinal, so I will convert into a factor.
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
all$GarageType[is.na(all$GarageType)] <- 'No Garage'
all$GarageType <- as.factor(all$GarageType)
table(all$GarageType)
##
## 2Types Attchd Basment BuiltIn CarPort Detchd No Garage
## 23 1723 36 186 15 778 158
GarageFinish: Interior finish of the garage
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
These values are ordinal.
all$GarageFinish[is.na(all$GarageFinish)] <- 'NA'
all$GarageFinish<-as.integer(revalue(all$GarageFinish, c('NA'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)))
table(all$GarageFinish)
##
## 0 1 2 3
## 158 1231 811 719
sum(table(all$GarageFinish))
## [1] 2919
ggplot(all[!is.na(all$GarageFinish),], aes(x=GarageFinish, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
Houses with no garages are worth more than houses with unfinished garages.
GarageQual: Garage quality
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Ordinal.
all$GarageQual[is.na(all$GarageQual)] <- 'None'
all$GarageQual<-as.integer(revalue(all$GarageQual, Qualities))
table(all$GarageQual)
##
## 0 1 2 3 4 5
## 158 5 124 2605 24 3
ggplot(all, aes(x=GarageQual, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
Only three garages received a top rating of Excellent, and the houses
are not very valuable.
GarageCond: Garage condition
GarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Ordinal. This variable looks to be almost the same as GarageQual. The data dictionary does not mention any distinction.
all$GarageCond[is.na(all$GarageCond)] <- 'None'
all$GarageCond<-as.integer(revalue(all$GarageCond, Qualities))
table(all$GarageCond)
##
## 0 1 2 3 4 5
## 158 14 74 2655 15 3
ggplot(all, aes(x=GarageCond, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
There are 11 variables that related to the basement of the house. Five of them have 79-82 NAs, and six have one NA each.
#check if all 79 NAs are the same observations among the variables with 80+ NAs
length(which(is.na(all$BsmtQual) & is.na(all$BsmtCond) & is.na(all$BsmtExposure) & is.na(all$BsmtFinType1) & is.na(all$BsmtFinType2)))
## [1] 79
#Find the additional NAs; BsmtFinType1 is the one with 79 NAs
all[!is.na(all$BsmtFinType1) & (is.na(all$BsmtCond)|is.na(all$BsmtQual)|is.na(all$BsmtExposure)|is.na(all$BsmtFinType2)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')]
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## 333 Gd TA No GLQ <NA>
## 949 Gd TA <NA> Unf Unf
## 1488 Gd TA <NA> Unf Unf
## 2041 Gd <NA> Mn GLQ Rec
## 2186 TA <NA> No BLQ Unf
## 2218 <NA> Fa No Unf Unf
## 2219 <NA> TA No Unf Unf
## 2349 Gd TA <NA> Unf Unf
## 2525 TA <NA> Av ALQ Unf
So altogether, it seems as if there are 79 houses without a basement, because the basement variables of the other houses with missing values are all 80% complete (missing 1 out of 5 values). I am going to impute the modes to fix the other 9 houses that only have one NA value each.
#Imputing modes.
all$BsmtFinType2[333] <- names(sort(-table(all$BsmtFinType2)))[1]
all$BsmtExposure[c(949, 1488, 2349)] <- names(sort(-table(all$BsmtExposure)))[1]
all$BsmtCond[c(2041, 2186, 2525)] <- names(sort(-table(all$BsmtCond)))[1]
all$BsmtQual[c(2218, 2219)] <- names(sort(-table(all$BsmtQual)))[1]
Now that the 5 variables considered agree upon 79 houses with ‘no basement’, I am going to factorize/hot encode them below.
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
An ordinal variable so we can make use of the Qualities vector.
all$BsmtQual[is.na(all$BsmtQual)] <- 'None'
all$BsmtQual<-as.integer(revalue(all$BsmtQual, Qualities))
table(all$BsmtQual)
##
## 0 2 3 4 5
## 79 88 1285 1209 258
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
An ordinal variable so we can make use of the Qualities vector.
all$BsmtCond[is.na(all$BsmtCond)] <- 'None'
all$BsmtCond<-as.integer(revalue(all$BsmtCond, Qualities))
table(all$BsmtCond)
##
## 0 1 2 3 4
## 79 5 104 2609 122
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
An ordinal variable.
all$BsmtExposure[is.na(all$BsmtExposure)] <- 'None'
Exposure <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)
all$BsmtExposure<-as.integer(revalue(all$BsmtExposure, Exposure))
table(all$BsmtExposure)
##
## 0 1 2 3 4
## 79 1907 239 418 276
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
Another ordinal variable.
all$BsmtFinType1[is.na(all$BsmtFinType1)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)
all$BsmtFinType1<-as.integer(revalue(all$BsmtFinType1, FinType))
table(all$BsmtFinType1)
##
## 0 1 2 3 4 5 6
## 79 851 154 288 269 429 849
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
Another ordinal variable.
all$BsmtFinType2[is.na(all$BsmtFinType2)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)
all$BsmtFinType2<-as.integer(revalue(all$BsmtFinType2, FinType))
table(all$BsmtFinType2)
##
## 0 1 2 3 4 5 6
## 79 2494 87 105 68 52 34
Remaining Basement variabes with just a few NAs
I now still have to deal with those 6 variables that have 1 or 2 NAs.
#display remaining NAs. Using BsmtQual as a reference for the 79 houses without basement agreed upon earlier
all[(is.na(all$BsmtFullBath)|is.na(all$BsmtHalfBath)|is.na(all$BsmtFinSF1)|is.na(all$BsmtFinSF2)|is.na(all$BsmtUnfSF)|is.na(all$TotalBsmtSF)), c('BsmtQual', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]
## BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121 0 NA NA NA NA NA
## 2189 0 NA NA 0 0 0
## TotalBsmtSF
## 2121 NA
## 2189 0
It should be obvious that those remaining NAs all refer to ‘not present’. Below, I am fixing those remaining variables.
BsmtFullBath: Basement full bathrooms
An integer variable.
all$BsmtFullBath[is.na(all$BsmtFullBath)] <-0
table(all$BsmtFullBath)
##
## 0 1 2 3
## 1707 1172 38 2
BsmtHalfBath: Basement half bathrooms
An integer variable.
all$BsmtHalfBath[is.na(all$BsmtHalfBath)] <-0
table(all$BsmtHalfBath)
##
## 0 1 2
## 2744 171 4
BsmtFinSF1: Type 1 finished square feet
An integer variable.
all$BsmtFinSF1[is.na(all$BsmtFinSF1)] <-0
BsmtFinSF2: Type 2 finished square feet
An integer variable.
all$BsmtFinSF2[is.na(all$BsmtFinSF2)] <-0
BsmtUnfSF: Unfinished square feet of basement area
An integer variable.
all$BsmtUnfSF[is.na(all$BsmtUnfSF)] <-0
TotalBsmtSF: Total square feet of basement area
An integer variable.
all$TotalBsmtSF[is.na(all$TotalBsmtSF)] <-0
MasVnrType has 24 NAs and MasVnrArea has 23 NAs.
length(which(is.na(all$MasVnrType) & is.na(all$MasVnrArea)))
## [1] 23
All 23 of MasVnrArea have NAs. I will deal with these first and then look into the one leftover MasVnrType.
MasVnrArea: Masonry veneer area in square feet
An NA in MasVnrArea implies that the house just doesn’t have a masonry veneer. I will impute the missing values with zero.
all$MasVnrArea[is.na(all$MasVnrArea)] <- 0
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
Not ordinal. Will convert into factor.
But first, let’s find the house that has a veneer area but no veneer type.
kable(all[all$MasVnrArea != 0 & is.na(all$MasVnrType), c('MasVnrArea', 'MasVnrType')])
| MasVnrArea | MasVnrType | |
|---|---|---|
| 2611 | 198 | NA |
Since we have no way of knowing which type of masonry this house had, let’s impute with the most common.
all$MasVnrType[2611] <- names(sort(-table(all$MasVnrType)))[2] #taking the 2nd value as the 1st is 'none'
Because MasVnrType is not ordinal it can be converted into a factor.
all$MasVnrType[is.na(all$MasVnrType)] <- 'None'
table(all$MasVnrType)
##
## BrkCmn BrkFace None Stone
## 25 880 1765 249
all$MasVnrType <- as.factor(all$MasVnrType)
ggplot(all, aes(x=MasVnrType, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
A factor variable which has four NAs.
#imputing the mode
all$MSZoning[is.na(all$MSZoning)] <- names(sort(-table(all$MSZoning)))[1]
all$MSZoning <- as.factor(all$MSZoning)
table(all$MSZoning)
##
## C (all) FV RH RL RM
## 25 139 26 2269 460
sum(table(all$MSZoning))
## [1] 2919
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
Factor variable with two NAs.
#imputing the mode
all$Utilities[is.na(all$Utilities)] <- names(sort(-table(all$Utilities)))[1]
all$Utilities <- as.factor(all$Utilities)
table(all$Utilities)
##
## AllPub NoSeWa
## 2918 1
sum(table(all$Utilities))
## [1] 2919
All houses but one have all public utilities.
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
An ordinal factor with two NAs.
I will impute the NAs with the most common functionality, Typ.
all$Functional[is.na(all$Functional)]<-names(sort(-table(all$Functional)))[1]
functional <- c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)
all$Functional <- as.integer(revalue(all$Functional, functional))
## The following `from` values were not present in `x`: Sal
table(all$Functional)
##
## 1 2 3 4 5 6 7
## 2 9 19 35 70 65 2719
sum(table(all$Functional))
## [1] 2919
Exterior1st
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Both variables are factor variables with only one NA. I will simply impute with the mode.
#1st
all$Exterior1st[is.na(all$Exterior1st)]<-names(sort(-table(all$Exterior1st)))[1]
table(all$Exterior1st)
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood
## 44 2 6 87 2 126 442 1 450 221
## Stone Stucco VinylSd Wd Sdng WdShing
## 2 43 1026 411 56
sum(table(all$Exterior1st))
## [1] 2919
#2nd
all$Exterior2nd[is.na(all$Exterior2nd)]<-names(sort(-table(all$Exterior2nd)))[1]
table(all$Exterior2nd)
##
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd Other
## 38 4 22 47 3 126 406 15 447 1
## Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
## 270 6 47 1015 391 81
sum(table(all$Exterior2nd))
## [1] 2919
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
Almost ordinal, but the mixed option throws it off. Only one NA so I will impute the mode.
all$Electrical[is.na(all$Electrical)]<-names(sort(-table(all$Electrical)))[1]
all$Electrical<-as.factor(all$Electrical)
table(all$Electrical)
##
## FuseA FuseF FuseP Mix SBrkr
## 188 50 8 1 2672
sum(table(all$Electrical))
## [1] 2919
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
One NA. Will impute with the mode. Ordinal so we can use the qualities vector.
all$KitchenQual[is.na(all$KitchenQual)]<-names(sort(-table(all$KitchenQual)))[1]
all$KitchenQual<-as.integer(revalue(all$KitchenQual, Qualities))
## The following `from` values were not present in `x`: None, Po
table(all$KitchenQual)
##
## 2 3 4 5
## 70 1493 1151 205
sum(table(all$KitchenQual))
## [1] 2919
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
Factor variable with one NA. I will impute with the mode.
all$SaleType[is.na(all$SaleType)]<-names(sort(-table(all$SaleType)))[1]
all$SaleType<-as.factor(all$SaleType)
table(all$SaleType)
##
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 87 5 26 9 8 12 239 7 2526
sum(table(all$SaleType))
## [1] 2919
I will run one final check to make sure we have fixed all NA variables.
NAcol <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
## SalePrice
## 1459
Now the only column with missing values is SalePrice, which we will be predicting.
Now that all of the NAs have been addressed, a few variables are still improperly categorized. Character variables need to be changed to factors or ordinal factors and a few numeric variables are actually factors as well. I’ll fix the character variables now. Similar to the NA section I have created a Tabset section that you can flip through.
Charcol <- names(all[,sapply(all, is.character)])
Charcol
## [1] "Street" "LandContour" "LandSlope" "Neighborhood"
## [5] "Condition1" "Condition2" "BldgType" "HouseStyle"
## [9] "RoofStyle" "RoofMatl" "Exterior1st" "Exterior2nd"
## [13] "ExterQual" "ExterCond" "Foundation" "Heating"
## [17] "HeatingQC" "CentralAir" "PavedDrive" "SaleCondition"
cat('There are', length(Charcol), 'remaining columns with character values')
## There are 20 remaining columns with character values
Street Street: Type of road access to property
Grvl Gravel
Pave Paved
Ordinal factor.
all$Street <- as.integer(revalue(all$Street, c('Grvl'=0, 'Pave'=1)))
table(all$Street)
##
## 0 1
## 12 2907
ggplot(all[!is.na(all$SalePrice),], aes(x=Street, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
PavedDrive PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
Ordinal Factor.
all$PavedDrive <- as.integer(revalue(all$PavedDrive, c('N'=0, 'P'=1, 'Y'=2)))
table(all$PavedDrive)
##
## 0 1 2
## 216 62 2641
ggplot(all[!is.na(all$SalePrice),], aes(x=PavedDrive, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
LandContour LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Factor. I checked for ordinality but level properties are actually worth less than hillside and depression.
all$LandContour <- as.factor(all$LandContour)
ggplot(all[!is.na(all$SalePrice),], aes(x=LandContour, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
LandSlope LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Factor.
all$LandSlope <- as.factor(all$LandSlope)
table(all$LandSlope)
##
## Gtl Mod Sev
## 2778 125 16
ggplot(all[!is.na(all$SalePrice),], aes(x=LandSlope, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Factor.
all$Neighborhood <- as.factor(all$Neighborhood)
ggplot(all[!is.na(all$SalePrice),], aes(x=Neighborhood, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
theme(axis.text.x = element_text(angle = -50, hjust = 0))
Condition1 Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Factor.
all$Condition1 <- as.factor(all$Condition1)
ggplot(all[!is.na(all$SalePrice),], aes(x=Condition1, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
theme(axis.text.x = element_text(angle = -40, hjust = 0))
Condition2 Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
all$Condition2 <- as.factor(all$Condition2)
ggplot(all[!is.na(all$SalePrice),], aes(x=Condition2, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
Factor.
all$BldgType <- as.factor(all$BldgType)
ggplot(all[!is.na(all$SalePrice),], aes(x=BldgType, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
HouseStyle 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level Factor
all$HouseStyle <- as.factor(all$HouseStyle)
ggplot(all[!is.na(all$SalePrice),], aes(x=HouseStyle, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
RoofStyle RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gambrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
Factor.
all$RoofStyle <- as.factor(all$RoofStyle)
ggplot(all[!is.na(all$SalePrice),], aes(x=RoofStyle, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
RoofMatl RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Factor.
all$RoofMatl <- as.factor(all$RoofMatl)
ggplot(all[!is.na(all$SalePrice),], aes(x=RoofMatl, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
Exterior1st Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Factor.
all$Exterior1st <- as.factor(all$Exterior1st)
ggplot(all[!is.na(all$SalePrice),], aes(x=Exterior1st, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
theme(axis.text.x = element_text(angle = -40, hjust = 0))
Exterior2nd Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Factor.
all$Exterior2nd <- as.factor(all$Exterior2nd)
ggplot(all[!is.na(all$SalePrice),], aes(x=Exterior2nd, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
theme(axis.text.x = element_text(angle = -40, hjust = 0))
ExterQual ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ordinal Factor.
all$ExterQual<-as.integer(revalue(all$ExterQual, Qualities))
## The following `from` values were not present in `x`: None, Po
ggplot(all[!is.na(all$SalePrice),], aes(x=ExterQual, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
ExterCond ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ordinal Factor.
all$ExterCond<-as.integer(revalue(all$ExterCond, Qualities))
## The following `from` values were not present in `x`: None
ggplot(all[!is.na(all$SalePrice),], aes(x=ExterCond, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
Factor.
all$Foundation <- as.factor(all$Foundation)
ggplot(all[!is.na(all$SalePrice),], aes(x=Foundation, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
Heating Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
Factor.
all$Heating <- as.factor(all$Heating)
ggplot(all[!is.na(all$SalePrice),], aes(x=Heating, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
HeatingQC HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ordinal Factor.
all$HeatingQC<-as.integer(revalue(all$HeatingQC, Qualities))
## The following `from` values were not present in `x`: None
ggplot(all[!is.na(all$SalePrice),], aes(x=HeatingQC, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
CentralAir CentralAir: Central air conditioning
N No
Y Yes
Ordinal Factor.
all$CentralAir<-as.integer(revalue(all$CentralAir, c('N'=0,'Y'=1)))
ggplot(all[!is.na(all$SalePrice),], aes(x=CentralAir, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
SaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
Factor.
all$SaleCondition <- as.factor(all$SaleCondition)
ggplot(all[!is.na(all$SalePrice),], aes(x=SaleCondition, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
### Double check character variables
Charcol <- names(all[,sapply(all, is.character)])
Charcol
## character(0)
cat('There are', length(Charcol), 'remaining columns with character values')
## There are 0 remaining columns with character values
A few variables that seem to be numeric at first glance are actually factors or ordinal factors. Let’s fix that.
Numcol <- names(all[,sapply(all, is.numeric)])
Numcol
## [1] "MSSubClass" "LotFrontage" "LotArea" "Street"
## [5] "LotShape" "OverallQual" "OverallCond" "YearBuilt"
## [9] "YearRemodAdd" "MasVnrArea" "ExterQual" "ExterCond"
## [13] "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1"
## [17] "BsmtFinSF1" "BsmtFinType2" "BsmtFinSF2" "BsmtUnfSF"
## [21] "TotalBsmtSF" "HeatingQC" "CentralAir" "X1stFlrSF"
## [25] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [29] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [33] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [37] "Fireplaces" "FireplaceQu" "GarageYrBlt" "GarageFinish"
## [41] "GarageCars" "GarageArea" "GarageQual" "GarageCond"
## [45] "PavedDrive" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [49] "X3SsnPorch" "ScreenPorch" "PoolArea" "PoolQC"
## [53] "MiscVal" "MoSold" "YrSold" "SalePrice"
The month a house is sold is a factor. Houses do not get more expensive in December (month 12) compared to January (month 1), or vice versa.
all$MoSold <- as.factor(all$MoSold)
ggplot(all[!is.na(all$MoSold),], aes(x=MoSold, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
### YrSold: Year Sold (YYYY)
Same situation for year sold. Factor. Generally, house prices go up, but the relationship is not completely clear. As this dataset only contains houses sold within a time span of five years which also includes a major financial crisis, factor makes the most sence in this scenario.
all$YrSold <- as.factor(all$YrSold)
ggplot(all[!is.na(all$YrSold),], aes(x=YrSold, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
### MSSubClass: Identifies the type of dwelling involved in the
sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSSubClass is coded with a number but is really a factor.
all$MSSubClass <- as.factor(all$MSSubClass)
ggplot(all[!is.na(all$MSSubClass),], aes(x=MSSubClass, y=SalePrice)) +
geom_bar(stat='summary', fun=median, fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).
There are four bathroom variables. By themselves they are not very predictive but I suspect if they were combined they could add value.
There are full baths and half baths. I created a Total Bathrooms variable which adds the number of bathrooms together, with half baths receiving a value of 0.5 and full baths receiving a value of 1.0.
BsmtFullBath, BsmtHalfBath, FullBath, HalfBath
all$TotalBathrooms <- all$BsmtFullBath + all$BsmtHalfBath*(0.5) + all$FullBath + all$HalfBath*(0.5)
tb1 <- ggplot(data=all[!is.na(all$SalePrice),], aes(x=as.factor(TotalBathrooms), y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
tb2 <- ggplot(data=all, aes(x=as.factor(TotalBathrooms))) +
geom_histogram(stat='count')
grid.arrange(tb1, tb2)
## `geom_smooth()` using formula = 'y ~ x'
Let’s create a variable for how old the house is at the time of sale. We could simply subtract YearBuilt from YrSold, but I would like to incorporate when houses have been remodeled. If a house is remodeled, then I set HouseAge equal to YrSold - YearRemodAdd and if the house was not remodeled then I set HouseAge equal to YrSold - YearBuilt
# Change to numeric so we can do math
all$YrSold <- as.numeric(as.character(all$YrSold))
all$YearBuilt <- as.numeric(as.character(all$YearBuilt))
all$YearRemodAdd <- as.numeric(as.character(all$YearRemodAdd))
all <- all %>%
mutate(HouseAge = case_when(
YearRemodAdd == YearBuilt ~ YrSold - YearBuilt,
.default = YrSold - YearRemodAdd
))
To go along with some houses being remodeled, it will be beneficial to have a variable that flags which houses have been remodeled.
all <- all %>%
mutate(RemodeledFlag = case_when(
YearRemodAdd == YearBuilt ~ 'No',
.default = 'Yes'
))
all$RemodeledFlag <- as.factor(all$RemodeledFlag)
# Change back to factor
all$YrSold <- as.factor(all$YrSold)
nb1 <- ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') + labs(x='Neighborhood', y='Median SalePrice') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
## Warning in geom_bar(stat = "summary", fun.y = "median", fill = "blue"):
## Ignoring unknown parameters: `fun.y`
nb2 <- ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=mean), y=SalePrice)) +
geom_bar(stat='summary', fun.y = "mean", fill='blue') + labs(x='Neighborhood', y="Mean SalePrice") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
## Warning in geom_bar(stat = "summary", fun.y = "mean", fill = "blue"): Ignoring
## unknown parameters: `fun.y`
grid.arrange(nb1, nb2)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
There are three neighborhoods that clearly have a higher mean and median
than the other neighborhoods. This is mirrored at the bottom of the
range with the three poorest neighborhoods. I will bin these three into
an ordinal factor.
all$RichPoor[all$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
all$RichPoor[!all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
all$RichPoor[all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0
table(all$RichPoor)
##
## 0 1 2
## 160 2471 288
There is no variable for the total square footage of the house. Let’s create one which adds up the above ground and below ground square footage.
all$TotalSqFeet <- all$GrLivArea + all$TotalBsmtSF
ggplot(data=all[!is.na(all$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_text_repel(aes(label = ifelse(all$GrLivArea[!is.na(all$SalePrice)]>4500, rownames(all), '')))
## `geom_smooth()` using formula = 'y ~ x'
There’s two very clear outliers. I will deal with this in the next
section.
All of the porch and deck variables are split up and won’t contribute much to the model by themselves, although I suspect they could add value if they were all added together.
all$DeckPorch <- all$WoodDeckSF + all$OpenPorchSF + all$EnclosedPorch + all$X3SsnPorch + all$ScreenPorch
ggplot(data=all[!is.na(all$SalePrice),], aes(x=DeckPorch, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
## `geom_smooth()` using formula = 'y ~ x'
Some of the variables in the dataset are co-linear. I am dropping a variable if two variables are highly correlated. To find these correlated pairs, I have used the correlations matrix again (see section 6.1). For instance: GarageCars and GarageArea have a correlation of 0.89. Of those two, I am dropping the variable with the lowest correlation with SalePrice (which is GarageArea with a SalePrice correlation of 0.62. GarageCars has a SalePrice correlation of 0.64).
dropVars <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalBsmtSF', 'TotalRmsAbvGrd', 'BsmtFinSF1')
all <- all[,!(names(all) %in% dropVars)]
There are two huge houses that didn’t sell for a lot of money. I will remove them from the dataset.
all <- all[-c(524, 1299),]
Before modeling I need to center and scale the ‘true numeric’ predictors (so not variables that have been label encoded), and create dummy variables for the categorical predictors. Below, I am splitting the dataframe into one with all (true) numeric variables, and another dataframe holding the (ordinal) factors.
numericVarNames <- numericVarNames[!(numericVarNames %in% c('MSSubClass', 'MoSold', 'YrSold', 'SalePrice', 'OverallQual', 'OverallCond'))] #numericVarNames was created before having done anything
numericVarNames <- append(numericVarNames, c('HouseAge', 'DeckPorch', 'TotalBathrooms', 'TotalSqFeet'))
DFnumeric <- all[, names(all) %in% numericVarNames]
DFfactors <- all[, !(names(all) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']
cat('There are', length(DFnumeric), 'numeric variables, and', length(DFfactors), 'factor variables')
## There are 30 numeric variables, and 49 factor variables
# str(DFnumeric)
Skewness Skewness is a measure of the symmetry in a distribution. A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0. Skewness essentially measures the relative size of the two tails. As a rule of thumb, skewness should be between -1 and 1. In this range, data are considered fairly symmetrical. In order to fix the skewness, I am taking the log for all numeric predictors with an absolute skew greater than 0.8 (actually: log+1, to avoid division by zero issues).
for(i in 1:ncol(DFnumeric)){
if (abs(skew(DFnumeric[,i]))>0.8){
DFnumeric[,i] <- log(DFnumeric[,i] +1)
}
}
Normalizing the data
PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)
## Created from 2917 samples and 30 variables
##
## Pre-processing:
## - centered (30)
## - ignored (0)
## - scaled (30)
DFnorm <- predict(PreNum, DFnumeric)
dim(DFnorm)
## [1] 2917 30
In order to change the factor variables into numerical variables so the predictive model works correctly, I will create ‘Dummy’ variables for each factor.
DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)
## [1] 2917 204
We should remove variables with no or few observations that will have little to no affect on the model.
#check if some values are absent in the test set
ZerocolTest <- which(colSums(DFdummies[(nrow(all[!is.na(all$SalePrice),])+1):nrow(all),])==0)
colnames(DFdummies[ZerocolTest])
## [1] "UtilitiesNoSeWa" "Condition2RRAe" "Condition2RRAn"
## [4] "Condition2RRNn" "HouseStyle2.5Fin" "RoofMatlMembran"
## [7] "RoofMatlMetal" "RoofMatlRoll" "Exterior1stImStucc"
## [10] "Exterior1stStone" "Exterior2ndOther" "HeatingOthW"
## [13] "ElectricalMix" "MiscFeatureTenC"
DFdummies <- DFdummies[,-ZerocolTest] #removing predictors
#check if some values are absent in the train set
ZerocolTrain <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])
## [1] "MSSubClass150"
DFdummies <- DFdummies[,-ZerocolTrain] #removing predictor
Also taking out variables with less than 10 ‘ones’ in the train set.
fewOnes <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])
## [1] "MSSubClass40" "LotConfigFR3" "NeighborhoodBlueste"
## [4] "NeighborhoodNPkVill" "Condition1PosA" "Condition1RRNe"
## [7] "Condition1RRNn" "Condition2Feedr" "Condition2PosA"
## [10] "Condition2PosN" "RoofStyleMansard" "RoofStyleShed"
## [13] "RoofMatlWdShake" "RoofMatlWdShngl" "Exterior1stAsphShn"
## [16] "Exterior1stBrkComm" "Exterior1stCBlock" "Exterior2ndAsphShn"
## [19] "Exterior2ndBrk Cmn" "Exterior2ndCBlock" "Exterior2ndStone"
## [22] "FoundationStone" "FoundationWood" "HeatingGrav"
## [25] "HeatingWall" "ElectricalFuseP" "GarageTypeCarPort"
## [28] "MiscFeatureOthr" "SaleTypeCon" "SaleTypeConLD"
## [31] "SaleTypeConLI" "SaleTypeConLw" "SaleTypeCWD"
## [34] "SaleTypeOth" "SaleConditionAdjLand"
DFdummies <- DFdummies[,-fewOnes] #removing predictors
dim(DFdummies)
## [1] 2917 154
Altogether, I have removed 49 one-hot encoded predictors with little or no variance.
combined <- cbind(DFnorm, DFdummies) #combining all (now numeric) predictors into one dataframe
skew(all$SalePrice)
## [1] 1.877427
qqnorm(all$SalePrice)
qqline(all$SalePrice)
The skew of 1.87 indicates a right skew that is too high, and the Q-Q plot shows that sale prices are also not normally distributed. To fix this I am taking the log of SalePrice.
all$SalePrice <- log(all$SalePrice) #default is the natural logarithm, "+1" is not necessary as there are no 0's
skew(all$SalePrice)
## [1] 0.1213182
As you can see,the skew is now quite low and the Q-Q plot is also looking much better.
qqnorm(all$SalePrice)
qqline(all$SalePrice)
train1 <- combined[!is.na(all$SalePrice),]
test1 <- combined[is.na(all$SalePrice),]
xgb_grid = expand.grid(
nrounds = 1000,
eta = c(0.1, 0.05, 0.01),
max_depth = c(2, 3, 4, 5, 6),
gamma = 0,
colsample_bytree=1,
min_child_weight=c(1, 2, 3, 4 ,5),
subsample=1
)
The next step is to let caret find the best hyperparameter values (using 5 fold cross validation).
# set.seed(7)
# my_control <-trainControl(method="cv", number=5)
#
# xgb_caret <- train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)], method='xgbTree', trControl= my_control, tuneGrid=xgb_grid)
# xgb_caret$bestTune
As expected, this took quite a bit of time (locally). In case you are running yourself I disabled the code, and am just continuing with the results. According to caret, the ‘bestTune’ parameters are:
Below, I am starting with the preparation of the data in the recommended format.
label_train <- all$SalePrice[!is.na(all$SalePrice)]
# put our testing & training data into two seperate Dmatrixs objects
dtrain <- xgb.DMatrix(data = as.matrix(train1), label= label_train)
dtest <- xgb.DMatrix(data = as.matrix(test1))
In addition, I am taking over the best tuned values from the caret cross validation.
default_param<-list(
objective = "reg:linear",
booster = "gbtree",
eta=0.05, #default = 0.3
gamma=0,
max_depth=2, #default=6
min_child_weight=3, #default=1
subsample=1,
colsample_bytree=1
)
The next step is to do cross validation to determine the best number of rounds (for the given set of parameters).
xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:10.955590+0.004262 test-rmse:10.955574+0.018177
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [41] train-rmse:1.428353+0.000684 test-rmse:1.429110+0.010132
## [81] train-rmse:0.228330+0.000776 test-rmse:0.235843+0.007827
## [121] train-rmse:0.119193+0.001264 test-rmse:0.137376+0.008352
## [161] train-rmse:0.107103+0.001347 test-rmse:0.129163+0.008222
## [201] train-rmse:0.101120+0.001198 test-rmse:0.125610+0.007949
## [241] train-rmse:0.096869+0.001145 test-rmse:0.123302+0.007657
## [281] train-rmse:0.093573+0.001134 test-rmse:0.121817+0.007461
## [321] train-rmse:0.090794+0.001161 test-rmse:0.121037+0.007093
## [361] train-rmse:0.088418+0.001041 test-rmse:0.120602+0.006958
## [401] train-rmse:0.086309+0.001068 test-rmse:0.120194+0.006763
## [441] train-rmse:0.084547+0.001029 test-rmse:0.119738+0.006478
## Stopping. Best iteration:
## [448] train-rmse:0.084219+0.001038 test-rmse:0.119689+0.006504
Although it was a bit of work, the hyperparameter tuning definitly paid of, as the cross validated RMSE improved considerably (from 0.1225 without the caret tuning, to 0.1177 in this version)!
#train the model using the best iteration found by cross validation
xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 475)
## [16:53:17] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
XGBpred <- predict(xgb_mod, dtest)
predictions_XGB <- exp(XGBpred) #need to reverse the log to the real values
head(predictions_XGB)
## [1] 118689.8 165311.7 185160.2 188604.0 192461.5 165139.0
#view variable importance plot
library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(train1),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:20], rel_to_first = TRUE)
sub_avg <- data.frame(Id = test_labels, SalePrice = round(predictions_XGB, 3))
head(sub_avg)
## Id SalePrice
## 1 1461 118689.8
## 2 1462 165311.7
## 3 1463 185160.2
## 4 1464 188604.0
## 5 1465 192461.5
## 6 1466 165139.0
write.csv(sub_avg, file = 'final_predictions.csv', row.names = F)